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Comment: Classifier Technology and the 
usion of Progress 



Jerome H. Friedman 

This paper provides a valuable service by asking 
us to reflect on recent developments in classifica- 
tion methodology to ascertain how far we have pro- 
gressed and what remains to be done. The sugges- 
tion in the paper is that the field has advanced very 
little over the past ten or so years in spite of all of 
the excitement to the contrary. 

It is of course natural to become overenthusias- 
tic about new methods. Academic disciplines are as 
susceptible to fads as any other endeavor. Statis- 
tics and machine learning are not exempt from this 
phenomenon. Often a new method is heavily cham- 
pioned by its developer(s) as the "magic bullet" that 
renders past methodology obsolete. Sometimes these 



courage to provide this type of service, and Profes- 
sor Hand is to be congratulated for this thoughtful 
article. 

Of course, simply because new methodologies are 
often overhyped does not necessarily imply that they 
do not, at least sometimes, represent important progress. 
In the case of classification, I believe that there have 
been major developments over the past ten years 
that have substantially advanced the field, both in 
terms of theory and practice. Although I find myself 
in agreement with most of the premises of this ar- 
ticle, I do not see how they lead to the implication 
that such advances are "largely illusionary." 

There appear to be three main premises presented 



arguments are accompanied by nontechnical metaphorsin the article. First, the improvements realized by 



such as brain biology, natural selection and human 
reasoning. The developers become gurus of a move- 
ment that eventually attracts disciples who in turn 
spread the word that a new dawn has emerged. All 
of this enthusiasm is infectious and the new method 
is adopted by practitioners who often uncritically 
assume that they are realizing benefits not afforded 
by previous methodology. Eventually realism sets in 
as the limitations of the newer methods emerge and 
they are placed in proper perspective. 

Such realism is often not immediately welcomed. 
Suggesting that an exciting new method may not 
bring as great an improvement as initially envisioned 
or that it may simply be a variation of existing 
methodology expressed in new vocabulary often elic- 
its a strong reaction. Thus, the messengers who bring 
this news tend to be, at least initially, unpopular 
among their colleagues in the field. It therefore takes 
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the newer methods over the previous ones are less 
than those achieved by the previous ones over their 
predecessors, presumably no methodology at all. Sec- 
ond, the evidence often presented (at least initially) 
in favor of the superiority of the newer methods is of- 
ten suspect. Finally, the newer methods do not solve 
all of the outstanding important problems that re- 
main in the field of classification. In my view these 
observations are correct and underappreciated in the 
field. The article does an important service by il- 
lustrating them so forcefully. However, the truth of 
these assertions does not imply lack of important 
progress; only that low- lying fruit is often easier to 
gather, we should be more thorough concerning vali- 
dation when initially presenting new procedures and 
there is still important work to be done. 

One of the main assertions in the paper is that, in 
many applications, older methods often yield error 
rates comparable to the more modern ones. This is 
of course true and is intrinsic to the classification 
problem, especially when the metric used to mea- 
sure performance is based on error rate. First, there 
is the irreducible error caused by the fact that the 
predictor variables x often do not contain enough in- 
formation to specify a unique value for the outcome 
variable y. At best, they specify a probability distri- 
bution of possible values Pr(y|x) which is hopefully 
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different for differing values of x, indicating some 
predictive power. This phenomenon afflicts all pre- 
diction problems. A second phenomenon is peculiar 
to classification; it is not necessary to accurately es- 
timate Pr(y|x) to achieve minimal error rate. All 
that is required of the estimates Pr(y|x) is 

(1) argmaxPr(?/|x) = argmaxPr(?/|x). 

V y 

The actual values of the estimates for differing val- 
ues of y need not be close to their respective under- 
lying true values. The estimates for the nonmaxi- 
mizing probabilities need not even be in the correct 
order. Thus, more flexible (modern) procedures that 
are better able to estimate more complex probabil- 
ity structures need not produce dramatically lower 
error rates in many applications. This also accounts 
for the "fiat minimum" effect discussed in the paper. 

As pointed out in the paper, classification proce- 
dures are often used in contexts where error rate 
is not the relevant quantity; functionals of Pr(7/|x) 
other than (1) are of interest. For example, in many 
two-class classification problems y G {— 1, 1}, the im- 
portant quantity is the rank order of {Pr(y = 
l|xj)}jg7^, where T is a set of observations with un- 
known outcome. In other applications, interest is 
in the actual probabilities themselves. In such set- 
tings it is likely that more accurate estimates of 
Pr(y|x) afforded by more flexible modern techniques 
will yield distinctly superior results to the older less 
flexible methods, even though their respective error 
rates are not dramatically different. The paper prop- 
erly criticizes the classification literature for present- 
ing comparisons mostly in terms of error rate, even 
though this is the criterion used for nearly all of the 
classification comparisons presented in the paper. 

The primary evidence intended to suggest lack of 
progress is the comparisons presented in Table 1. 
Here the error rate of an older method, linear dis- 
criminant analysis (LDA), is compared with that of 
the current best method for each of a selected set 
of problems. In spite of the general insensitivity of 
error rate as differentiating criterion (as discussed 
above), LDA seems to produce distinctly inferior re- 
sults in many of these problems. In more than half of 
the examples, its error rate is at least 45% greater; 
in one example, it is nearly six times as great. Of 
course there is a selection bias of unknown magni- 
tude in choosing the best method, but it is difficult 
to conclude from the evidence presented that LDA 
is competitive with the best current methods, even 



in terms of error rate. The paper suggests that large 
ratios in small error rates "will correspond to only 
a small proportion of new data points." This is true 
but not relevant. If a zip code classifier makes twice 
as many errors, it costs the post office twice as much 
to handle the misdirected mail. I have yet to see a 
problem where costs are proportional to the Prop 
linear statistic shown in the last column of Table 1. 

The paper presents a regression example (Section 2.1) 
to illustrate that including additional predictor vari- 
ables that are highly correlated with those that are 
already part of the analysis produces little gain in 
performance. This is true of all methods, old and 
new, and no evidence is provided to suggest that 
older methods are better able to incorporate addi- 
tional information from such variables. 

A second principal premise of the paper is that 
the evidence for the superiority of new methods is 
generally based on empirical comparisons which are 
susceptible to major weaknesses that place their va- 
lidity in question. I could not be in more agreement 
with this point. Section 5 of the paper should be re- 
quired reading for all practitioners and researchers 
in the field. In my data mining course, I have a 
lecture called "comparison caution" that addresses 
many of the same issues. Empirical comparisons should 
be viewed with skepticism, especially when the au- 
thors' new method is one of the competitors. Even 
when this is not the case, the authors performing the 
study often have a favorite technique which usually 
emerges as the top performer. When interpreting 
such studies, I tend to ignore the apparent top per- 
former and look at the relative rankings of the other 
methods, presuming that the authors have less ex- 
pertise and vested interest in them. Even when a 
comparison is free of all of the biases discussed in 
Section 5, its results should not be extrapolated be- 
yond the specifics of the problem represented by the 
data set being used. All methods have particular 
problems for which they are especially well suited 
and others for which they are not. Sometimes only 
a minor change in the problem setup can produce 
substantial changes in performance rankings. Re- 
sults of empirical comparisons can be useful, espe- 
cially when aggregated over time, but the natural 
tendency to overinterpret individual studies should 
be avoided. Of course, the same caution should be 
applied to the empirical comparisons presented in 
this paper. 

Simply because the initial evidence for the supe- 
riority of a method can be questioned does not nec- 
essarily imply that is not useful or that it does not 
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represent progress. Practitioners try various meth- 
ods and, as time evolves, some emerge as being more 
useful that others. Many of the "new" proposals 
of the distant past have not survived the test of 
time and are now long forgotten. Those that have 
emerged as being generally useful, such as logistic 
regression, LDA and decision trees, have survived 
to see common use. No one is claiming that all of 
the new techniques proposed in the literature over 
past ten years represent major advances. However, 
I believe that a body of evidence is emerging that 
suggests that some of them, such as the ensemble 
methods (bagging and boosting) and support vec- 
tor machines, offer substantial advantages over the 
earlier methods in enough situations to be regarded 
as major advances. This is especially the case in sci- 
entific and engineering applications, where decision 
boundaries are often complex and far from being 
linear. 

Another major premise of the paper is that there 
are important issues that affect classification per- 
formance that are not addressed by most modern 
methodology. These include population drift, sam- 
ple selectivity bias, errors in class labels and arbi- 
trariness in class definitions. Again I could not agree 
more. Issues of nonrepresentative training data tend 
to be overlooked by the academic community, al- 
though they are probably well known to most prac- 
titioners. (See [3]. I spend several lectures in my 
data mining course covering these topics.) Obtaining 
high-quality representative training data is generally 
more important to success than choice of a particu- 
lar classifier, although given such data, choosing the 
best classifier can often provide considerable addi- 
tional benefit. In many data mining applications, 
the data were collected for a different purpose than 
solving the current problem and one does not have 
influence over its quality or value. The analyst is 
forced to do the best that can be done with the 
data at hand. 

The problem of training data being different from 
future data to be predicted is common to all predic- 
tion, not just classification. The fundamental issue 
is similar whether the differences arise through ran- 
dom sampling from a static population or are caused 
by one of the more deterministic mechanisms cited 
in the paper. As noted in the paper, the antidote is 
to limit reliance on the training data by not fitting 
it too closely. This is the basic principal underlying 
regularization. The paper argues that older methods 
are "simpler," thereby inducing more regularization. 



which in turn causes them to be more resistant to 
these types of problems. This need not be the case. 

Almost all of the modern procedures incorporate 
a regularization parameter that controls the degree 
to which they are allowed to fit the training data. By 
adjusting the value of this parameter, one can pro- 
duce a sequence of models of increasing complexity 
from the very simplest that makes the same predic- 
tion everywhere to highly complex functions that 
capture the fine details of the predictive relation- 
ship as reflected in the training data. Highly regu- 
larized versions of different procedures may capture 
somewhat different aspects of the gross features of 
the probability distribution, but in the absence of 
knowledge concerning the nature of the population 
drift, there is no a priori reason to suspect that one is 
better than the other. An important consequence of 
the presence of population drift and related prob- 
lems is that model selection based on traditional 
techniques such as bootstrapping or cross-validation 
becomes overly optimistic; they will tend to produce 
insufficient regularization. Thus, care must be taken 
to regularize more heavily than suggested by these 
model selection techniques when such problems are 
suspected. 

Most older classification methods limit the degree 
to which one can control the amount of regulariza- 
tion. It it not clear that the amount arbitrarily ap- 
plied by these procedures is necessarily appropriate 
in any particular problem. In fact there are many sit- 
uations in which older methods provide insufficient 
regularization. This is especially the case in mod- 
ern analytical chemistry and bioinformatics appli- 
cations, where there are many more predictor vari- 
ables than training observations, and simple logis- 
tic regression and LDA completely fail. There has 
been considerable recent research that has led to 
modern classification methods than allow the appli- 
cation of more regularization than the older tradi- 
tional methods. These, in my view, also represent 
major progress. 

Errors in class labels is a classic robustness issue. 
Estimation in the presence of badly measured out- 
comes has been extensively studied in the regression 
literature, but less so in classification. As in regres- 
sion, the solution is to employ loss criteria that are 
less sensitive to individual extreme measurements. It 
has been suggested that logistic likelihood and the 
support vector machine hinge loss are more robust 
to misspecification of class labels than squared-error 
loss or, especially, the exponential loss associated 
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with AdaBoost, since they weight reahzed outcomes 
of low estimated probability less heavily. Even more 
robust (nonconvex) loss criteria have been proposed 
for classification (see [1, 2]). Some older methods 
such as logistic regression should be fairly robust 
to mislabeling, but others like LDA are likely to ex- 
hibit poor robustness properties; estimates of the 
pooled covariance matrix can be highly distorted by 
only a few mislabeled observations, especially at the 
extremes of the data distribution. 

The problem of arbitrariness of class labels is of- 
ten caused by trying to make the problem conform 
to the method rather than the other way around. If 
an outcome variable realizes continuous numeric val- 
ues, then it should be treated as such and regression 
rather than classification technology would be more 
appropriate. There have been recent important ad- 
vances in regression methodology that parallel those 
in classification. If thresholding numeric variables to 
create a classification problem happens to be appro- 
priate and the class labels have changed, then, as the 
paper suggests, one can simply retrain the classifier 
with the new definitions. This requires that the orig- 
inal raw data be saved. Given the very low cost of 
storage media, this should always be encouraged for 
a wide variety of reasons. 

Recent research has not solved all of the outstand- 
ing problems in the field of classification, especially 
those associated with nonrepresentative training data. 
All procedures are vulnerable to these effects and, as 
discussed above, it is not clear that the older meth- 
ods enjoy more immunity than the more recent ones. 
Also, these problems are more prevalent in the com- 
mercial sector involving financial and consumer be- 
havior applications than in scientific and engineering 
fields where the laws governing the systems under 
study tend to be more stable. Nevertheless, solu- 
tions to these problems would also represent major 
advances. The paper does an important service by 
directing our attention to them, but this does not 
imply that there has not been substantial progress 
in other important aspects of the classification prob- 
lem in the recent past. 

Whether or not a new method represents impor- 
tant progress is, at least initially, a value judge- 
ment upon which people can agree to disagree. Ini- 
tial hype can be misleading and only with the pas- 
sage of time can such controversies be resolved. It 
may well be too soon to draw conclusions concern- 
ing the precise value of recent developments, but to 
conclude that they represent very little progress is at 



best premature and, in my view, contrary to present 
evidence. 

I thank Professor Hand for this thoughtfully provoca- 
tive article. It gives all of us an opportunity to look 
past our enthusiasm and take a deeper look at the 
remaining central issues. I look forward to research 
that produces solutions to these outstanding prob- 
lems and to future discussions as to whether they 
represent major progress. Finally, I would like to 
add another relevant quote to that of Eric Hoffer 
mentioned in the article. This one is attributed to 
Yogi Berra: "Prediction is difficult, especially when 
it's about the future." 
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